This guide demonstrates how to perform functional clustering analysis on one-dimensional time series data using the MultiConnector package. We use ovarian cancer cell growth data as a case study to illustrate the complete workflow from data import to biological interpretation. The analysis identifies distinct growth patterns and relates them to cellular progeny information, providing insights into cancer cell behavior and potential therapeutic targets.
Functional clustering is a powerful statistical method for analyzing time series data where the goal is to group curves based on their shape and temporal patterns rather than individual time points. This approach is particularly valuable in biological and medical research where understanding distinct temporal patterns can reveal important insights about:
MultiConnector implements advanced functional clustering methods based on the Sugar & James model, which:
In this guide, we analyze ovarian cancer cell line growth curves to:
# Load required libraries
library(dplyr) # Data manipulation
library(parallel) # Parallel computing
library(MultiConnector) # Main clustering package
library(ggplot2) # Enhanced plotting
library(knitr) # Table formatting
library(kableExtra) # Enhanced table styling
# Set up parallel processing
n_cores <- detectCores()
workers <- max(1, n_cores - 1) # Leave one core free
cat("System Information:\n")#> System Information:
#> - Available CPU cores: 8
#> - Cores used for analysis: 7
For one-dimensional functional clustering, your data should include:
subjID,
measureID, time, valuesubjID and
feature columns (e.g., treatment groups, demographics)We begin by loading ovarian cancer cell growth data, which contains:
The ConnectorData() function validates and structures
the data for analysis:
system.file("Data/OvarianCancer/Ovarian_TimeSeries.xlsx", package="MultiConnector") -> time_series_path
system.file("Data/OvarianCancer/Ovarian_Annotations.txt", package="MultiConnector") -> annotations_path
# Create the main data object
Data <- ConnectorData(time_series_path,annotations_path)#> ###############################
#> ######## Summary ##############
#>
#> Number of curves:# A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 21
#> ;
#> Min curve length: # A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 7
#> ; Max curve length: # A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 18
#> ###############################
Understanding your data structure is crucial before clustering. We examine:
# Plot 1: Basic time series overview
p1 <- plot(Data) +
ggtitle("A) All Growth Curves") +
theme_minimal()
# Plot 2: Colored by progeny feature
p2 <- plot(Data, feature = "Progeny") +
ggtitle("B) Curves by Progeny Type") +
theme_minimal()
# Combine plots if possible (requires gridExtra or patchwork)
if (requireNamespace("gridExtra", quietly = TRUE)) {
gridExtra::grid.arrange(p1, p2, ncol = 2)
} else {
print(p1)
print(p2)
}Initial data exploration: (A) All growth curves overlaid, (B) Curves colored by progeny type.
Time point distribution analysis showing data density across the measurement period.
From the initial exploration, we can observe:
Many time series datasets have sparse data at later time points. Truncation can improve clustering stability by focusing on well-sampled regions.
Truncation analysis helping to identify optimal cutoff time for maintaining data quality.
# Apply truncation based on analysis
DataTrunc <- truncate(Data, measure = "Ovarian", truncTime = 70)#> ###############################
#> ######## Summary ##############
#>
#> Number of curves:# A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 21
#> ;
#> Min curve length: # A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 7
#> ; Max curve length: # A tibble: 1 × 1
#> nTimePoints
#> <int>
#> 1 16
#> ###############################
# Visualize truncated data
plot(DataTrunc) +
ggtitle("Growth Curves After Truncation (t ≤ 70)") +
theme_minimal()Data after truncation at time = 70, showing improved data density.
The spline dimension (p) parameter controls curve
flexibility:
# Estimate optimal spline dimension
CrossLogLikePlot <- estimatepDimension(DataTrunc, p = 2:6, cores = workers)
# Display results
print(CrossLogLikePlot)Cross-validation results for spline dimension selection showing optimal p value.
# Set optimal value (typically where CV error is minimized)
optimal_p <- 3
cat("Selected optimal p =", optimal_p, "\n")#> Selected optimal p = 3
| p Value | Characteristics | Best For |
|---|---|---|
| 2-3 | Smooth, simple curves | Linear/quadratic patterns |
| 4-5 | Moderate flexibility | Complex but stable patterns |
| 6+ | High flexibility | Complex curves (risk overfitting) |
We test multiple cluster numbers to find the optimal solution:
# Perform clustering analysis
# Note: This is computationally intensive
clusters <- estimateCluster(
DataTrunc,
G = 2:6, # Test 2-6 clusters
p = optimal_p, # Use optimal spline dimension
runs = 20, # Reduced for demonstration (use 100+ for final analysis)
cores = workers # Parallel processing
)#> [1] "Total time: 11.44 secs"
Clustering quality metrics across different numbers of clusters (G).
Based on quality metrics, we select the optimal configuration:
# Select optimal clustering (G=4 based on quality metrics)
ClusterData <- selectCluster(clusters, G = 4, "MinfDB")Selected Configuration:
# Plot clusters
p1 <- plot(ClusterData) +
ggtitle("A) Clusters by Assignment") +
theme_minimal()
# Plot by progeny feature
p2 <- plot(ClusterData, feature = "Progeny") +
ggtitle("B) Clusters by Progeny Type") +
theme_minimal()
# Display plots
if (requireNamespace("gridExtra", quietly = TRUE)) {
gridExtra::grid.arrange(p1, p2, ncol = 2)
} else {
print(p1)
print(p2)
}Cluster visualization: (A) Growth curves colored by cluster assignment, (B) Curves colored by progeny type to examine biological associations.
# Examine cluster-annotation relationships
annotations_summary <- getAnnotations(ClusterData)
print(annotations_summary)#> [1] "IDSample" "Progeny" "Source" "Real.Progeny"
# Create summary table if annotations exist
if (exists("annotations_summary") && length(annotations_summary) > 0) {
kable(annotations_summary,
caption = "Cluster-annotation summary showing the distribution of features across clusters.") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
}| x |
|---|
| IDSample |
| Progeny |
| Source |
| Real.Progeny |
Comprehensive validation ensures clustering reliability:
# Perform validation analysis
Metrics <- validateCluster(ClusterData)
# Display validation plots
print(Metrics$plot)Cluster validation metrics: (A) Silhouette analysis showing how well samples fit their clusters, (B) Entropy analysis measuring assignment uncertainty.
Discriminant plots show clusters in reduced dimensional space:
#> [1] "Percentage of variance explained:"
#> [1] 91.731111 5.665049 2.603840
#> [1] "Sum of first two components: 97.4"
Based on our clustering results, we can identify distinct growth patterns:
Our functional clustering analysis revealed:
The MultiConnector package successfully:
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-apple-darwin23.4.0
#> Running under: macOS 15.5
#>
#> Matrix products: default
#> BLAS: /usr/local/Cellar/openblas/0.3.28/lib/libopenblasp-r0.3.28.dylib
#> LAPACK: /usr/local/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Rome
#> tzcode source: internal
#>
#> attached base packages:
#> [1] parallel stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] kableExtra_1.4.0 knitr_1.50 ggplot2_3.5.2 dplyr_1.1.4
#> [5] MultiConnector_1.0
#>
#> loaded via a namespace (and not attached):
#> [1] gridExtra_2.3 remotes_2.5.0 readxl_1.4.5
#> [4] rlang_1.1.6 magrittr_2.0.3 compiler_4.4.1
#> [7] roxygen2_7.3.2 systemfonts_1.2.3 vctrs_0.6.5
#> [10] stringr_1.5.1 profvis_0.4.0 pkgconfig_2.0.3
#> [13] crayon_1.5.3 fastmap_1.2.0 MetBrewer_0.2.0
#> [16] ellipsis_0.3.2 magic_1.6-1 labeling_0.4.3
#> [19] promises_1.3.3 rmarkdown_2.29 sessioninfo_1.2.3
#> [22] tzdb_0.5.0 purrr_1.1.0 bit_4.6.0
#> [25] xfun_0.53 cachem_1.1.0 jsonlite_2.0.0
#> [28] gghalves_0.1.4 later_1.4.4 R6_2.6.1
#> [31] bslib_0.9.0 stringi_1.8.7 RColorBrewer_1.1-3
#> [34] rlist_0.4.6.2 pkgload_1.4.0 jquerylib_0.1.4
#> [37] cellranger_1.1.0 Rcpp_1.1.0 usethis_3.1.0
#> [40] readr_2.1.5 httpuv_1.6.16 Matrix_1.7-3
#> [43] splines_4.4.1 tidyselect_1.2.1 rstudioapi_0.17.1
#> [46] dichromat_2.0-0.1 abind_1.4-8 yaml_2.3.10
#> [49] codetools_0.2-20 miniUI_0.1.2 pkgbuild_1.4.8
#> [52] lattice_0.22-7 tibble_3.3.0 shiny_1.11.1
#> [55] withr_3.0.2 evaluate_1.0.4 desc_1.4.3
#> [58] isoband_0.2.7 urlchecker_1.0.1 xml2_1.4.0
#> [61] pillar_1.11.0 geometry_0.5.2 plotly_4.11.0
#> [64] generics_0.1.4 vroom_1.6.5 rprojroot_2.1.1
#> [67] hms_1.1.3 commonmark_2.0.0 scales_1.4.0
#> [70] xtable_1.8-4 RhpcBLASctl_0.23-42 glue_1.8.0
#> [73] lazyeval_0.2.2 tools_4.4.1 data.table_1.16.4
#> [76] fs_1.6.6 grid_4.4.1 tidyr_1.3.1
#> [79] crosstalk_1.2.2 devtools_2.4.5 patchwork_1.3.2
#> [82] cli_3.6.5 textshaping_1.0.1 viridisLite_0.4.2
#> [85] svglite_2.2.1 gtable_0.3.6 sass_0.4.10
#> [88] digest_0.6.37 htmlwidgets_1.6.4 farver_2.1.2
#> [91] memoise_2.0.1 htmltools_0.5.8.1 lifecycle_1.0.4
#> [94] httr_1.4.7 statmod_1.5.0 mime_0.13
#> [97] bit64_4.6.0-1 MASS_7.3-65
Sugar, C. A., & James, G. M. (2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463), 750-763.
James, G. M., & Sugar, C. A. (2003). Clustering for sparsely sampled functional data. Journal of the American Statistical Association, 98(462), 397-408.
Ramsay, J. O., & Silverman, B. W. (2005). Functional data analysis. Springer.
Ferraty, F., & Vieu, P. (2006). Nonparametric functional data analysis: theory and practice. Springer.
subjID,
measureID, time, value)p) if clustering failscores parameterThis guide was generated using the MultiConnector package. For updates and additional resources, visit the package documentation.